---
title: Grounding Models
description: Models that support click prediction with ComputerAgent.predict_click()
---

These models specialize in UI element grounding and click prediction. They can identify precise coordinates for UI elements based on natural language descriptions, but cannot perform autonomous task planning.

Use `ComputerAgent.predict_click()` to get coordinates for specific UI elements.

All models that support `ComputerAgent.run()` also support `ComputerAgent.predict_click()`. See [All‑in‑one CUAs](./computer-use-agents).

### Anthropic CUAs

- Claude 4.1: `claude-opus-4-1-20250805`
- Claude 4: `claude-opus-4-20250514`, `claude-sonnet-4-20250514`
- Claude 3.7: `claude-3-7-sonnet-20250219`
- Claude 3.5: `claude-3-5-sonnet-20241022`

### OpenAI CUA Preview

- Computer-use-preview: `computer-use-preview`

### UI-TARS 1.5 (Unified VLM with grounding support)

- `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B`
- `huggingface/ByteDance-Seed/UI-TARS-1.5-7B` (requires TGI endpoint)

## Specialized Grounding Models

These models are optimized specifically for click prediction and UI element grounding:

### OpenCUA

- `huggingface-local/xlangai/OpenCUA-{7B,32B}`

### GTA1 Family

- `huggingface-local/HelloKKMe/GTA1-{7B,32B,72B}`

### Holo 1.5 Family

- `huggingface-local/Hcompany/Holo1.5-{3B,7B,72B}`

### InternVL 3.5 Family

- `huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}`

### OmniParser (OCR)

OCR-focused set-of-marks model that requires an LLM for click prediction:

- `omniparser` (requires combination with any LiteLLM vision model)

### Moondream3 (Local Grounding)

Moondream3 is a powerful small model that can perform UI grounding and click prediction.

- `moondream3`

## Usage Examples

```python
# Using any grounding model for click prediction
agent = ComputerAgent("claude-3-5-sonnet-20241022", tools=[computer])

# Predict coordinates for specific elements
login_coords = agent.predict_click("find the login button")
search_coords = agent.predict_click("locate the search text field")
menu_coords = agent.predict_click("find the hamburger menu icon")

print(f"Login button: {login_coords}")
print(f"Search field: {search_coords}")
print(f"Menu icon: {menu_coords}")
```

```python
# OmniParser is just for OCR, so it requires an LLM for predict_click
agent = ComputerAgent("omniparser+anthropic/claude-3-5-sonnet-20241022", tools=[computer])

# Predict click coordinates using composed agent
coords = agent.predict_click("find the submit button")
print(f"Click coordinates: {coords}")  # (450, 320)

# Note: Cannot use omniparser alone for click prediction
# This will raise an error:
# agent = ComputerAgent("omniparser", tools=[computer])
# coords = agent.predict_click("find button")  # Error!
```

```python
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B", tools=[computer])

# Predict click coordinates for UI elements
coords = agent.predict_click("find the submit button")
print(f"Click coordinates: {coords}")  # (450, 320)

# Note: GTA1 cannot perform autonomous task planning
# This will raise an error:
# agent.run("Fill out the form and submit it")
```

---

For information on combining grounding models with planning capabilities, see [Composed Agents](./composed-agents) and [All‑in‑one CUAs](./computer-use-agents).
